require(tidyverse)
require(knitr)
require(DiagrammeR)
require(geosphere)
require(ade4)
require(adespatial)
require(vcfR)
require(plotly)
require(vegan)
require(adegenet)

Readme

This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the relevant html file in a browser.

Rationale

We use ddRADseq data from 12 sampling locations across the Pacific Ocean to develop a GTseq marker panel for genetic stock identification and population genetic structure of Pacific Albacore.

Previous analyses used STRUCTURE and PCA to reveal population genetic structure that separates North from South Pacific sampling locations and outlier approaches identified a set of SNP markers that are strongly differentiated these two groups. However, no genetic structure was identified at smaller spatial scales using these tools, despite some potential evidence of restricted dispersal at a sub-gyre scale from stable isotope analysis and region specific growth rates and size.

Given the large effective population size (Ne) of Albacore, we suggest caution in interpreting the PCA and STRUCTURE results as evidence of panmixia within North and South Pacific Albacore populations. Fst at sub-gyre spatial scales was substantially below the threshold at which these tools can infer genetic structure, and large Ne can result in low Fst despite substantial demographic indepedence among demes.

In this notebook, we select features for a future GTseq panel after fully investigating potential genetic structure across a diversity of spatial scales available in the sampled locations. Identification of potential fine scale genetic structure is accomplished through:

  1. Hierarchical Structure: While signficant Fst between pairwise sampling locales with the North Pacific were low and mostly non-significant, stronger patterns may be observed at increasing levels of hierarchical structure.

  2. Spatially Explicit Analysis: We will decompose spatial structure among the sampling locations into the full range of scales to attempt to identify the smallest spatial structures that show evidence of population genetic structure in the ddRADseq data.

  3. Statistical Learning: Subtle polygenic structure among sampling locations may reveal population structure that univariate approaches may miss. Methods like random forest and other statistical learning algorithms have demonstrated greater sensitivity to subtle population genetic structure than methods like PCA and STRUCTURE.

Data

Here we read in genetic data

There are many version of the SNP dataset. For a first pass RDA we’re going to to use one without PSV or LD filtering.

vcf <- read.vcfR("input_data/populations.snps.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 14
##   header_line: 15
##   variant count: 6624
##   column count: 334
## 
Meta line 14 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 6624
##   Character matrix gt cols: 334
##   skip: 0
##   nrows: 6624
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant: 6624
## All variants processed
genind <- vcfR2genind(vcf)

#now lets get the population information encoded
genind$pop <- as.factor(str_sub(row.names(genind$tab), 1, 2))

#convert to genpop format 
genpop <- genind2genpop(genind)
## 
##  Converting data from a genind to a genpop object... 
## 
## ...done.
#lets collect just the north pacific
genind_north <- genind[pop(genind) != "TS" & pop(genind) != "NC"]

k-means and DAPC

As a first pass, we’ll attempt to cluster the data using k-means? Then, visualize differences among genetic clusters using DAPC. This shouldn’t work well given k-menas power with such subtle structure, but it’s low hanging fruit. Below we look for any populations that tend to cluster by k-means

kmeans <- find.clusters(genind_north, n.pca = 246, n.clust = 2)
kable(table(pop(genind_north), kmeans$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2
BA 4 0
BC 12 0
CN 19 0
CS 40 0
HN 29 0
HW 29 0
JP 22 0
OR 34 0
PH 28 2
WA 28 0
kmeans <- find.clusters(genind_north, n.pca = 246, n.clust = 3)
kable(table(pop(genind_north), kmeans$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2 3
BA 0 4 0
BC 0 12 0
CN 0 19 0
CS 0 40 0
HN 0 29 0
HW 0 29 0
JP 0 22 0
OR 0 32 2
PH 0 30 0
WA 2 26 0
kmeans <- find.clusters(genind_north, n.pca = 246, n.clust = 4)
kable(table(pop(genind_north), kmeans$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2 3 4
BA 2 2 0 0
BC 5 5 0 2
CN 11 8 0 0
CS 20 20 0 0
HN 16 11 0 2
HW 5 22 0 2
JP 15 7 0 0
OR 19 9 2 4
PH 17 13 0 0
WA 15 13 0 0
kmeans <- find.clusters(genind_north, n.pca = 246, n.clust = 5)
kable(table(pop(genind_north), kmeans$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2 3 4 5
BA 0 2 0 0 2
BC 4 5 0 0 3
CN 0 11 0 0 8
CS 2 24 0 0 14
HN 0 16 0 2 11
HW 8 18 0 0 3
JP 0 8 0 0 14
OR 4 12 2 0 16
PH 2 14 0 0 14
WA 2 8 0 0 18

Nope, it appears the k-means just continously splits clusters across sampling locations, without any clusters that correspond to clear spatial structures among sampling locations.

Spatial Analyses

Methods Summary

  1. RDA: Determine if we can find a decomposition of spatial structure that significantly explains genetic structure in the dataset OTHER THAN the North/South spatial structure successfully identified in the Vaux manuscript.

Spatial Variables

Here we generate the spatial variables from the raw location data. We take two approaches. In the first, we use what is most commonly applied to when working with population genetic data, distance based moran’s eigenvectors. In the second we fully explore a set of graphs (spatial neighborhoods) connecting sampling connections to build the spatial weighting matrix that best captures genetic variance. The latter may be more powerful if spatial structures other than IBD based autocorrelation drives stucture.

We also generate the spatial variables for both the full dataset, and a north pacific only dataset.

dbMEMs

The spatial variable is a distance based moran’s eigenvector map (dbMEM). dbMEM is used to decompose the spatial structure into a set of orthoganal eigenfunctions thereby providing better resolution of autocorrelation or local structures.

# First let's get the rough lat and lons for the dataset
# just manually made a table from Vaux based on sampling map (no location information was provided in ms or supplemental materials)

sites <- read_tsv("input_data/rough_locations.txt")

#now let's focus just on NPAC samples
Nsites <- sites[c(1:10),]

# now lets get the individual level dataset
sites_big <- data.frame(pop = genind$pop)
sites_big <- sites_big %>%
  left_join(sites, by = c("pop" = "Pop"))
Nsites_big <- sites_big %>%
  filter(pop != "NC" & pop != "TS")

#first convert all lat lon to geodetic cartesian coords

#anti-meridian causes an issue, let's fix it
sites_big<- sites_big %>%
  mutate(Longitude = if_else(Longitude < 0, Longitude + 360, Longitude))
Nsites_big<- Nsites_big %>%
  mutate(Longitude = if_else(Longitude < 0, Longitude + 360, Longitude))

#now calc geodetic coords
require(SoDA)
XY_big <- geoXY(sites_big[,3], sites_big[,4], unit =1000)
NXY_big  <- geoXY(Nsites_big[,3], Nsites_big[,4], unit =1000)

#add jitter
XY_big[,1] <- jitter(XY_big[,1])
XY_big[,2]<- jitter(XY_big[,2])

NXY_big[,1] <- jitter(NXY_big[,1])
NXY_big[,2]<- jitter(NXY_big[,2])


#now calculate dbMEMs (spatial autocorrelation)
dbmems <- dbmem(XY_big, MEM.autocor = "all")
Ndbmems <- dbmem(NXY_big, MEM.autocor = "all")

Now that dbmems are made let’s take a look at them.

# first eigenvalues/Moran's
barplot(attr(dbmems, "values"), 
        main = "Eigenvalues of the spatial weighting matrix\nProportional to Moran's I\nFull Dataset", cex.main = 0.7)

barplot(attr(Ndbmems, "values"), 
        main = "Eigenvalues of the spatial weighting matrix\nProportional to Moran's I\nNorth Pacific Only Dataset", cex.main = 0.7)

For the full dataset only 4 dbMEMs have a positive I (i.e. spatial autocorrelation), while the NPac has 2. Next let’s test these to see if they are worth keeping (i.e. if moran’s I is greater than 0 with a permutation test). Then we’ll plot them in space and see what we have.

#first recompute dbmems keeping only positive to speed up rand test (no need to test ones we're not going to use)

dbmems <- dbmem(XY_big, store.listw = TRUE)
Ndbmems <- dbmem(NXY_big, store.listw = TRUE)

test <- moran.randtest(dbmems, nrepet = 99)
test
## class: krandtest lightkrandtest 
## Monte-Carlo tests
## Call: moran.randtest(x = dbmems, nrepet = 99)
## 
## Number of tests:   4 
## 
## Adjustment method for multiple comparisons:   none 
## Permutation number:   99 
##   Test         Obs    Std.Obs   Alter Pvalue
## 1 MEM1 0.798788040 161.234645 greater   0.01
## 2 MEM2 0.417907085 110.695867 greater   0.01
## 3 MEM3 0.103438553  29.145564 greater   0.01
## 4 MEM4 0.006030934   2.186919 greater   0.07
test <- moran.randtest(Ndbmems, nrepet = 99)
test
## class: krandtest lightkrandtest 
## Monte-Carlo tests
## Call: moran.randtest(x = Ndbmems, nrepet = 99)
## 
## Number of tests:   2 
## 
## Adjustment method for multiple comparisons:   none 
## Permutation number:   99 
##   Test        Obs    Std.Obs   Alter Pvalue
## 1 MEM1 0.49692989 190.300028 greater   0.01
## 2 MEM2 0.01110166   4.367928 greater   0.02

If we use an uncorrected p < 0.1 cutoff, the full dataset has 4 significant dbMEMs and the NPac has 2.

Now let’s plot these in space to see what they look like. First the set of MEMs for the North Pacific

# we'll use marmap tomake some nice looking plots of the pacific and color sampling locations by there dbMEM values

# North Pacific First
#first get the marmap data
require(marmap)
Npac_bathy <- getNOAA.bathy(lon1 = 130, lon2 = -116, lat1 = 11, lat2 = 53, resolution = 10, antimeridian = TRUE)

#now get the mem and sampling sites together
Nmem_plot <- as.data.frame(cbind(Ndbmems, Nsites_big))
#fix the antimeridian problem
Nmem_plot<- Nmem_plot %>%
  mutate(Longitude = if_else(Longitude < 0, Longitude + 360, Longitude))

#NPAC

#mem1
autoplot(Npac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM1), data = Nmem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("NPAC MEM1")

#mem2
autoplot(Npac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM2), data = Nmem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("NPAC MEM2")

Next the set of MEMs for the Full Pacific.

pac_bathy <- getNOAA.bathy(lon1 = 130, lon2 = -116, lat1 = -50, lat2 = 53, resolution = 10, antimeridian = TRUE)

#now get the mem and sampling sites together
mem_plot <- as.data.frame(cbind(dbmems, sites_big))
#fix the antimeridian problem
mem_plot<- mem_plot %>%
  mutate(Longitude = if_else(Longitude < 0, Longitude + 360, Longitude))

#NPAC

#mem1
autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2400)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM1), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("PAC MEM1")

#mem2
autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2400)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM2), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("PAC MEM2")

#mem3
autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2400)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM3), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("PAC MEM3")

#mem4
autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2400)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM4), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option="A")+ggtitle("PAC MEM4")

cRDA (redundancy analysis of components)

Goals:
For RDAs, the primary goal is to determine if we can find a decomposotion of spatial structure that significantly explains genetic structure in the dataset OTHER THAN the North/South structure successfully identified in the Vaux manuscript. If we identify structure the variable loadings from the RDAs can also be used for initial feature selection.

Note on dataset Here we run the first pass, fastest possible analysis using moran’s eigenvector maps. We collapse genotypic data among individuals into principal components of genetic variation (thus cRDA). Note that larger analyses such as RDA on individual genotypes may have greater power and allow loadings on individual loci, but if this faster analysis discovers population substructure, than we can be encouraged to continue to these locus level analyses

Genetic Data

To fit the cRDA we need to get the PCs of genetic variation for each individual and the dbmems into a dataframe

First the PCs for the full dataset

#first replace missing data with mean frequency
X <- tab(genind, freq = TRUE, NA.method = "mean")

#then run pca
pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 324)

#check pcs to keep with kaiser-guttman

#kaiser guttman
cutoff<-mean(pca1$eig)
kg <- length((pca1$eig)[(pca1$eig)>cutoff])
barplot(pca1$eig, main = "PCA eigenvalues")
abline(h = cutoff, col = "red")

#kept 149 PCs that exceed kaiserguttman
snp_pcs <- pca1$li[,c(1:kg)]

Using the kaiser-guttman criteria, we kept the first 149 PCs in the full SNP dataset.

Now Let’s take a look at these populations in PC space.

#now plot data
pca_plot_data <- as.data.frame(cbind(genind$pop, snp_pcs))
pca_plot_data <- pca_plot_data %>%
  rename(pop = "genind$pop")
ggplot(data = pca_plot_data)+geom_point(aes(Axis1, Axis2, color = pop)) + stat_ellipse(aes(Axis1, Axis2, color = pop)) +theme_classic()

ggplot(data = pca_plot_data)+geom_point(aes(Axis1, Axis3, color = pop)) + stat_ellipse(aes(Axis1, Axis3, color = pop)) +theme_classic()

No surpises here, same results as Vaux ms. Fst is too low for PCA to discriminate among pops within a gyre, but it does cluster S Pac samples apart from NPac and there are some samples with admixture between these two clusters. The only difference is that there are some strong outliers among the Tasmania samples, potentially these are due to some of the filtering parameters being different. (check later)

Now lets do the same, just for the north pacific samples

##### now just the north samples

#first replace missing data with mean frequency
Xn <- tab(genind_north, freq = TRUE, NA.method = "mean")

#then run pca
pca_north <- dudi.pca(Xn, scale = FALSE, scannf = FALSE, nf = 245)

#check pcs to keep with kaiser-guttman

#kaiser guttman
cutoff<-mean(pca_north$eig)
kg <- length((pca_north$eig)[(pca_north$eig)>cutoff])
barplot(pca_north$eig, main = "PCA eigenvalues")
abline(h = cutoff, col = "red")

#kept 117 PCs that exceed kaiserguttman
Nsnp_pcs <- pca_north$li[,c(1:kg)]
Nsnp_pcs <- pca_north$li

#now plot data
Npca_plot_data <- as.data.frame(cbind(genind_north$pop, Nsnp_pcs))
Npca_plot_data <- Npca_plot_data %>%
  rename(pop = "genind_north$pop")
ggplot(data = Npca_plot_data)+geom_point(aes(Axis1, Axis2, color = pop)) + stat_ellipse(aes(Axis1, Axis2, color = pop)) +theme_classic()

ggplot(data = Npca_plot_data)+geom_point(aes(Axis3, Axis4, color = pop)) + stat_ellipse(aes(Axis3, Axis4, color = pop)) +theme_classic()

Again, no surprises. This time kept 117 PCs to fit with RDA later. No evidence of structure within North Pacific.

Full Dataset cRDA

Now that we have gathered all of our variables, we’ll fit the cRDA using the full dataset

Global Test and Variable Selection

## [1] 0.01363899
## Step: R2.adj= 0 
## Call: snp_pcs ~ 1 
##  
##                  R2.adjusted
## <All variables> 0.0136389906
## + MEM1          0.0085735766
## + MEM2          0.0028312054
## + MEM4          0.0016057208
## + MEM3          0.0005018099
## <none>          0.0000000000
## 
##        Df  AIC      F Pr(>F)   
## + MEM1  1 1924 3.8019  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.008573577 
## Call: snp_pcs ~ MEM1 
##  
##                 R2.adjusted
## <All variables> 0.013638991
## + MEM2          0.011440201
## + MEM4          0.010210910
## + MEM3          0.009103571
## <none>          0.008573577
## 
##        Df    AIC      F Pr(>F)   
## + MEM2  1 1924.1 1.9366  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.0114402 
## Call: snp_pcs ~ MEM1 + MEM2 
##  
##                 R2.adjusted
## <All variables>  0.01363899
## + MEM4           0.01309157
## + MEM3           0.01198078
## <none>           0.01144020
## 
##        Df    AIC      F Pr(>F)   
## + MEM4  1 1924.5 1.5388  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.01309157 
## Call: snp_pcs ~ MEM1 + MEM2 + MEM4 
##  
##                 R2.adjusted
## + MEM3           0.01363899
## <All variables>  0.01363899
## <none>           0.01309157
## 
## Start: snp_pcs ~ 1 
## 
##        Df    AIC      F Pr(>F)   
## + MEM1  1 1924.0 3.8019  0.005 **
## + MEM2  1 1925.9 1.9199  0.005 **
## + MEM4  1 1926.3 1.5211  0.005 **
## + MEM3  1 1926.7 1.1627  0.025 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: snp_pcs ~ MEM1 
## 
##        Df    AIC      F Pr(>F)   
## - MEM1  1 1925.8 3.8019  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        Df    AIC      F Pr(>F)   
## + MEM2  1 1924.1 1.9366  0.005 **
## + MEM4  1 1924.5 1.5343  0.005 **
## + MEM3  1 1924.8 1.1728  0.040 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: snp_pcs ~ MEM1 + MEM2 
## 
##        Df    AIC      F Pr(>F)   
## - MEM2  1 1924.0 1.9366  0.005 **
## - MEM1  1 1925.9 3.8129  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        Df    AIC      F Pr(>F)   
## + MEM4  1 1924.5 1.5388  0.005 **
## + MEM3  1 1924.9 1.1762  0.050 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: snp_pcs ~ MEM1 + MEM2 + MEM4 
## 
##        Df    AIC      F Pr(>F)   
## - MEM4  1 1924.1 1.5388  0.005 **
## - MEM2  1 1924.5 1.9399  0.005 **
## - MEM1  1 1926.4 3.8193  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        Df    AIC      F Pr(>F)  
## + MEM3  1 1925.3 1.1782  0.035 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: snp_pcs ~ MEM1 + MEM2 + MEM4 + MEM3 
## 
##        Df    AIC      F Pr(>F)   
## - MEM3  1 1924.5 1.1782  0.010 **
## - MEM4  1 1924.9 1.5396  0.005 **
## - MEM2  1 1925.3 1.9410  0.005 **
## - MEM1  1 1927.2 3.8214  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Testing variable 1
## Testing variable 2
## Testing variable 3
## Testing variable 4
## MEM1 MEM2 MEM3 MEM4 
##    1    1    1    1

All variable selection procedures agree on dropping MEM3, keeping all others.

Final Model

Now that we have established that the full model is signifcant and done variable selection to identify the spatial structures, let’s build the final model.

#final model
#rda_final <- rda(snp_pcs ~ MEM3 + MEM2 + MEM7 + MEM9 + MEM6 , data = mems)
rda_final <- rda(snp_pcs ~ MEM1 + MEM2 + MEM4, data = dbmems)

#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(rda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')
rda_axis
#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <-  anova.cca(rda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
rda_expl$pve<- rda_expl$Variance/sum(rda_expl$Variance)

rda_expl

Permutation analyses suggest that the first two redundant axes significantly correlate patterns in the spatial dataset with patterns in the genetic dataset. Separate permutations also show that the selected MEMs significantly explain the PCs.

## screeplot

constrained_percent <- rda_final$CCA$eig/rda_final$tot.chi*100
unconstrained_percent <- rda_final$CA$eig/rda_final$tot.chi*100
barplot (c(constrained_percent, unconstrained_percent), col = c(rep ('red', length (constrained_percent)), rep ('black', length (unconstrained_percent))), las = 2, ylab = '% variation')
Screeplot of RDA - red is constrained axes, black is unconstrained axes

Screeplot of RDA - red is constrained axes, black is unconstrained axes

Full Pacific cRDA Results Summary

## tri plot
#site score
rda_sum<-summary(rda_final, axes = c(2))
scores <- as.data.frame(rda_sum$sites)
scores$pop<-genind$pop


#species scores
arrows<-as.data.frame(rda_sum$biplot)

#plot by pop
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), data = scores)+geom_segment(data=arrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="black")+geom_text(data=arrows, aes(x=RDA1+RDA1*0.3, y=RDA2+RDA2*0.3, label=c("MEM1", "MEM2","MEM4")), size = 4,  color="black")+stat_ellipse(aes(RDA1, RDA2, color = pop), data = scores)+theme_classic()+xlab("Redundant Axis 1\n1.5% of Total Variance")+ylab("Redundant Axis 2\n0.48% of Total Variance")+scale_color_discrete()#+scale_color_viridis_d(option = "A")

We fit a global model of all variables. This full model was significant. We then performed forward variable selection. Three approaches yielded the same result and we built the final model on MEMs 1, 2 and 4

Only the first 2 RDA axes are significant (i.e. explain significantly more than the equivalent unconstrained axes, Holms adjusted p value < 0.05). All explanatory variables (MEMs) in the final model are also significant (i.e. significantly greater explanation of variance for the variable, once the data is conditioned on all other variables, Holms adjusted p value < 0.05)

Collectively these suggest that there is a significant relationship between genetic variation and spatial autocorrelation (although note this model did not attempt to parse population structure from autocorrelation). The first two constrained axes captured ~ 2.3% of the total variance in the model. Since only the first 149 PCs of genetic variation were included, we can not directly relate this back to FST, but a napkin calculation would suggest we fit about half to total variance in the dataset. This would bring these results in close alignment with the Fst attributed to North vs South population genetic structure.

When we examine the spatial structures in the RDA triplot we can see that while MEMs 1 and 2 load strongly onto the first redundant axis, and this axis separates North from South pacific populations. This is not surprising given the results from the Vaux MS.

However, the second redundant axis provides very interesting results that suggest significant spatial structures that “are redundant with” genetic variance within the North Pacific. Specifically redundant axis 2 is driven almost exclusively by MEM4 and this MEM (while distance based and mostly separating Hawaii from all other samples), also maps strongly onto a N-S gradient within the west coast N Pac samples. Indeed when we examine the tri-plot we can observe clinal variation along the West Coast samples. This pattern (RDA2 separates N-S pops and Hawaii), may be due to (a) introgression tails from the southern hemisphere (as suggested by Vaux ms), (b) spatial autocorrelation produced by IBD within the North Pacific, or (c) an adaptive genetic cline. Parsing (c) from (a and b) may be possible with an environmental association analysis, but parsing (a) and (b) may be challenging without haplotypic data. Conducting the analysis on the North Pac only samples (below), may also clarify these patterns

North Pacific cRDA

Global Test and Variable Selection

#global rda
Nrda_null <- rda(Nsnp_pcs ~ 1, data = Ndbmems)
Nrda_full <- rda(Nsnp_pcs ~ . , data = Ndbmems)


#check that the full model is significant
anova(Nrda_full) # 0.002 yes it is significant - proceed to variable selection
#what the variance explained
NadjR2.rda_full <- RsquareAdj(Nrda_full)$adj.r.squared
NadjR2.rda_full
## [1] 0.0009799238
#variable selection
# here we do variable selection with three different forward selection approaches
NordiR2 <- ordiR2step(Nrda_null, scope = formula(Nrda_full, data = Ndbmems))
## Step: R2.adj= 0 
## Call: Nsnp_pcs ~ 1 
##  
##                  R2.adjusted
## <All variables> 0.0009799238
## + MEM2          0.0007409189
## + MEM1          0.0002350053
## <none>          0.0000000000
## 
##        Df    AIC      F Pr(>F)   
## + MEM2  1 1583.9 1.1824  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.0007409189 
## Call: Nsnp_pcs ~ MEM2 
##  
##                  R2.adjusted
## <All variables> 0.0009799238
## + MEM1          0.0009799238
## <none>          0.0007409189
## 
##        Df    AIC      F Pr(>F)  
## + MEM1  1 1584.8 1.0586  0.064 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
NordiR2$anova  
Nordi <- ordistep(Nrda_null, scope = formula(Nrda_full, data = Ndbmems)) 
## 
## Start: Nsnp_pcs ~ 1 
## 
##        Df    AIC      F Pr(>F)   
## + MEM2  1 1583.9 1.1824  0.005 **
## + MEM1  1 1584.0 1.0578  0.065 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: Nsnp_pcs ~ MEM2 
## 
##        Df    AIC      F Pr(>F)   
## - MEM2  1 1583.1 1.1824  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        Df    AIC      F Pr(>F)  
## + MEM1  1 1584.8 1.0586  0.085 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nordi$anova
Nrda.fs <- forward.sel(Y = Nsnp_pcs, X = Ndbmems, adjR2thresh = NadjR2.rda_full)
## Testing variable 1
## Testing variable 2
## Procedure stopped (alpha criteria): pvalue for variable 2 is 0.056000 (> 0.050000)
Nrda.fs 

Global is significant, variable selection keeps both MEMs

Final Model

Now that we have established that the full model is signifcant and done variable selection to identify the spatial structures, let’s build the final model.

#final model
Nrda_final <- rda(Nsnp_pcs ~ MEM1 + MEM2, data = Ndbmems)

#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(Nrda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')
rda_axis
#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <-  anova.cca(Nrda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
rda_expl$pve<- rda_expl$Variance/sum(rda_expl$Variance)

rda_expl

Keep 2 RDAs; all MEMs

North Pacific cRDA Results Summary

## tri plot
#site score
Nrda_sum<-summary(Nrda_final, axes = c(2))
Nscores <- as.data.frame(Nrda_sum$sites)
Nscores$pop<-genind_north$pop


#species scores
Narrows<-as.data.frame(Nrda_sum$biplot)

#plot by pop
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), data = Nscores)+geom_segment(data=Narrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="black")+geom_text(data=Narrows, aes(x=RDA1+RDA1*0.3, y=RDA2+RDA2*0.3, label=c("MEM1", "MEM2")), size = 4,  color="black")+stat_ellipse(aes(RDA1, RDA2, color = pop), data = Nscores)+theme_classic()+xlab("Redundant Axis 1\n0.48% of Total Variance")+ylab("Redundant Axis 2\n0.44% of Total Variance")+scale_color_discrete()#+scale_color_viridis_d(option = "A")

We fit a global model of all variables. This full model was significant. We then performed forward variable selection. Three approaches yielded the same result and we built the final model on MEMs 1, and 2.

Only the first 2 RDA axes are significant (i.e. explain significantly more than the equivalent unconstrained axes, Holms adjusted p value < 0.05). All explanatory variables (MEMs) in the final model are also significant (i.e. significantly greater explanation of variance for the variable, once the data is conditioned on all other variables, Holms adjusted p value < 0.05)

Collectively these suggest that there is a significant relationship between genetic variation and spatial autocorrelation at the tested scales. The first two constrained axes captured ~ 0.9% of the total variance in the model. Since only the first 117 PCs of genetic variation were included, we can not directly relate this back to FST.

When we examine the RDA triplot we can see that MEMs 1 and 2 load strongly onto different redundant axes. Three clusters of populations are apparent: eastern (West coast N America), central (Hawaii), and western (Japan + Phillipines). RDA1 (and MEM2) separates all three clusters of populations with western cluster intermediate, while RDA2 (and MEM1) clusters the eastern and central clusters together and spearates the western cluster. Taken together, it is possible to explain a small portion of the genetic variation in the north Pacifc as a result of spatial auotocorrelation among these three clusters, although there may be other demographic and selective scenarios that can produce similar patterns (see discussion below)

cRDA Discussion

Both cRDAs show that there are spatial structures at a variety scales that can explain variation in the genetic variation within the Pacific. However synthesizing the results from the North Pacific dataset and the full dataset raises some questions.

(1) Is there a N-S cline within the N Pac samples, if so why don’t we see it in the NPac cRDA?
Since our MEMs are dbMEMs, they are a spectral decomposition of the distances between populations. No dbMEMs in the North Pacific dataset captured this N-S gradient. I’m not sure why the finest scale saptial autocorrelation was not captured by a MEM with positive Moran’s I. Perhaps using a different spatial neighborhood to generate MEMs would capture this pattern, or we could just test for it directly by isolating these samples, or fitting for latitude.

In any case, the absence of the pattern in the North Pac cRDA is most easily explained by the fact that we did not test for it. An explicit test of latitude as an explanation for genetic variation is possible, and we can even parse this from spatial autocorrelation per se with yet another RDA or other canonical ordination approach.

(2) Is this overfit?

Maybe.

A lesson from population genomics on marine organisms is that we have so much data that we tend to find patterns whenever we look for them. While a lot of authors attribute this to efficiency of selection in large populations coupled with abundant environmental gradients, overfitting is also a concern. Several lines of evidence suggest these data should be robust to overfitting. (1) Only kept principal components of genetic variation that exceeded a criterion, didn’t fit on the full dataset and (2) the analysis should be resistant to overfitting in the traditional sense because n (samples) >> p (predictors: MEMs) and we performed variable selection on the tiny number of predictors we did use. However, there’s enough variation in the response variables (SNP PCA) that there is a possibility that the TRUE patterns in the dataset we are modeling are not true in nature (i.e. our procedures for avoiding overfitting ar all focused on the explanatory variable side). Only validation with new data can confirm. Alternatively, we could attempt to conduct some type of cross-validation approach. This might be worthwhile before selecting SNPs.

(3) Admixture Cline?

We might be able to parse the N-S cline However the most powerful MEMs are always the ones that are capturing N-S structure. Therefore it seems likely that what’s actually going on here is that Latitude is a better predictor of genetics than spatial autocorrelation. We can test this by explicitly including latitude and then doing variance paritioning / pRDA. Because the SWM seems to mostly capture E-W distance, this might actually have some power.

Hierararchical Structure

Methods Summary

Treefit?

Statistical Learning

Methods Summary

dropped analyses

this section is a dump for code that we did not use

MEMs

The dbMEMs are distance based and represent the decomposition of distances, however, we know more than just distances and we might be interesting in spatial weighting matrices based on other spatial neighborhoods/graphs of connections among populations. In other words, there may be spatial structures other than simple spatial autocorrelation at different scales that we’d like to try to fit with an RDA, but we can’t test for these if our dbMEMs don’t capture them. So we’ll use some of the functionality in the adespatial package to choose a spatial weighting matrix to maximize r2

library(adespatial);library(sp);library(spdep)
nb <- chooseCN(coordinates(sitesmat), type = 1, plot.nb = TRUE)
lw <- nb2listw(nb, style = 'W', zero.policy = TRUE)

mems1 <- mem(lw)

barplot(attr(mems1, "values"))
ggplot()+geom_point(aes(x = sitesxy$Longitude, y = sitesxy$Latitude, color = mems1$MEM5), size = 3) +theme_classic()+scale_color_viridis_c(option = "A")


dbmems2 <- dbmem(sitesmat, store.listw = TRUE, silent = FALSE, MEM.autocor = "all", thresh = 100 )

barplot(attr(dbmems2, "values"))


#get sites matrix for full dataset
sites_big <- left_join(pca_plot_data, sites, by = c("pop" = "Pop"))[,c(249,248)]
#add jitter
sites_big$Longitude <- jitter(sites_big$Longitude)
sites_big$Latitude <- jitter(sites_big$Latitude)
sites_big <- as.matrix(sites_big)
#add jitter
sites_b

#build list of candidate spatial weighting matrices
cands <- listw.candidates(sites_big, nb = c("gab", "rel", "mst"), weights = c("bin", "flin"))

# choose best spatial weighting matrix
best.lw <- listw.select(snp_pcs[,c(1:117)], candidates = cands, nperm = 10, nperm.global = 99, MEM.all = TRUE, method = "global")
#rda with mems from delauney triangulation

mems_big <- left_join(pca_plot_data, as.data.frame(cbind(mems1, sites$Pop)), by = c("pop" = "sites$Pop"))
mems_big <- mems_big[,247:255]
#global rda
rda_null <- rda(snp_pcs ~ 1, data = mems_big)
rda_full <- rda(snp_pcs ~ . , data = mems_big)


#check that the full model is significant
anova(rda_full) # 0.002 yes it is significant - proceed to variable selection

#what the variance explained
adjR2.rda_full <- RsquareAdj(rda_full)$adj.r.squared
adjR2.rda_full


#variable selection
# here we do variable selection with three different forward selection approaches
ordiR2 <- ordiR2step(rda_null, scope = formula(rda_full, data = mems_big))
ordiR2$anova #MEMs 3,2,7,9 and 6 retained by forward selection 
ordi <- ordistep(rda_null, scope = formula(rda_full, data = mems_big)) 
ordi$anova#also adds mem 1
rda.fs <- forward.sel(Y = snp_pcs, X = mems_big, adjR2thresh = adjR2.rda_full)
rda.fs # same as ordiR2


#final model
#rda_final <- rda(snp_pcs ~ MEM3 + MEM2 + MEM7 + MEM9 + MEM6 , data = mems)
rda_final <- rda(snp_pcs ~ MEM1 + MEM8 + MEM7 + MEM2 + MEM6 + MEM3 + MEM9, data = mems_big)

#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(rda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')
rda_axis

#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <-  anova.cca(rda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
rda_expl$pve<- rda_expl$Variance/sum(rda_expl$Variance)

rda_expl


# just autocor

rda_pos <- rda(snp_pcs ~ MEM1 + MEM2 + MEM3, data = mems_big)
rda_null <- rda(snp_pcs ~ 1, data = mems_big)
ordiR2 <- ordiR2step(rda_null, scope = formula(rda_pos, data = mems_big))
ordiR2$anova
rda_pos <- rda(snp_pcs ~  MEM2 + MEM3, data = mems_big)


#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(rda_pos, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')
rda_axis

#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <-  anova.cca(rda_pos, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
rda_expl$pve<- rda_expl$Variance/sum(rda_expl$Variance)

rda_expl

## tri plot
#site score
rda_sum<-summary(rda_pos, axes = c(2))
scores <- as.data.frame(rda_sum$sites)
scores$pop<-cRDA_data$pop


#species scores
arrows<-as.data.frame(rda_sum$biplot)

#plot by pop
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), data = scores)+geom_segment(data=arrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="black")+geom_text(data=arrows, aes(x=RDA1+RDA1*0.3, y=RDA2+RDA2*0.3, label=c( "MEM2","MEM3")), size = 4,  color="black")+stat_ellipse(aes(RDA1, RDA2, color = pop), data = scores)+theme_classic()+xlab("Redundant Axis 1\n0.64% of Total Variance")+ylab("Redundant Axis 2\n0.61% of Total Variance")+scale_color_discrete()
#plot mem2 and mem3 on a map together

#get data
mem_plot <- as.data.frame(cbind(sites_big, mems_big))
#fix antimeridian issue
mem_plot<- mem_plot %>%
  mutate(Longitude = if_else(Longitude < 0, Longitude + 360, Longitude))

#first make plots separately 
# gp1 <- autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM2), data = mem_plot, size = 3)+theme(legend.position = "none")
# gp2 <- autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM2), data = mem_plot, size = 3)+theme(legend.position = "none")

mem_plot$color <- rgb(blue = scales::rescale(mem_plot$MEM2), red = scales::rescale(mem_plot$MEM3), green = 0)
autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = color), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_identity()


autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM2), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option = "A")


autoplot(pac_bathy, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2200)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(aes(x = Longitude, y = Latitude, color = MEM3), data = mem_plot, size = 3)+theme(legend.position = "none")+scale_color_viridis_c(option = "A")